Search Result

Select

Footnote Identification within a PDF Document

LI Sida, GAO Liangcai, TANG Zhi, YU Yinyan

Acta Scientiarum Naturalium Universitatis Pekinensis 2015, 51 (6): 1017-1021. DOI: 10.13209/j.0479-8023.2015.087

Abstract （1279）

Save

A robust method of identifying and linking footnote and its reference in the text is proposed to solve the footnote recognition problem. Novel features of the footnote, including page layout, font information, lexical and linguistic features, are utilized for the task. Clustering is adopted to handle the features which vary in different kinds of documents but stable within one document so that the process of identification is adaptive with document types. In addition, this method leverages results from the matching process to provide feedback to the identification process and further improves the algorithm accuracy. The primary experiments in real document sets show that the proposed method is promising to identify footnote in a PDF document.

Related Articles | Metrics | Comments（0）

Select

A Supervised Dynamic Topic Model

JIANG Zhuoren,CHEN Yan,GAO Liangcai,TANG Zhi,LIU Xiaozhong

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （1209）

PDF（pc）（3346KB）（829）

Save

An innovative Supervised Dynamic Topic Model (S-DTM) is developed for overcoming the limitation of tranditional topic models. S-DTM models the time-varying language dynamics and is combined with supervised learning technology by adding label restriction in topic variational inference. It makes the topic-label mapping and improves the interpret ability of topics. A set of experiments is conducted on a twenty-five-year-spanning Chinese journal paper corpus that is mainly focusing on natural language processing. Experiment results show that compared with static supervised topic model and unsupervised dynamic topic model, S-DTM has a better semantic interpretation performance, reflects the topic structure of a document more accurately, captures the dynamic evolution of the term-distribution of topics more precisely.

Related Articles | Metrics | Comments（0）

Select

A Study on Classification of Forms with Similar Layout

WANG Simeng,GAO Liangcai,WANG Yuehan,LI Pingli,TANG Zhi

Select

Research on Mathematical Formula Identification in Digital Chinese Documents

LIN Xiaoyan,GAO Liangcai,TANG Zhi

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （775）

PDF（pc）（536KB）（512）

Save

Different from the traditional formula identification methods for scanned images and Latin documents, a formula identification method which considers the characteristics of digital Chinese documents is proposed to identify both isolated and embedded formulae using both machine learning techniques and heuristic rules. Text line detection strategies and word segmentation rules are proposed towards Chinese documents, effective features and machine learning algorithms of formula identification from Chinese documents are selected, and post-processing techniques, including text line or word merging, are proposed to overcome the over-segmentation problems. The experimental results show that the proposed method achieves satisfactory results in identifying formulae from digital Chinese documents. Furthermore, a public Chinese document dataset is constructed in order to facilitate the fair comparison between different formula identification methods.

Related Articles | Metrics | Comments（0）

Select

Bit Allocation Algorithm for Joint Spatial-Temporal Scalabilities in H.264 SVC

PANG Yan,LIU Jiaying,GAO Liangcai,GUO Zongming

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （767）

PDF（pc）（723KB）（524）

Save

A bit allocation algorithm on joint spatial-temporal (S-T) scalabilities layers is proposed for H.264 SVC. Taking account of the characteristic of SVC encoding procedure, a two-step model-based bit allocation scheme is developed. Along spatial/temporal scalability dimension in each step, the bit allocation issue is formulated as an optimization problem. Then, the rate and distortion (R-D) models for dependent spatial and temporal layers are derived separately, where the complicated inter-layer dependencies are sufficiently considered. Finally, using the Lagrange multiplier method, the proposed algorithm can be solved numerically with derived R-D models. Experimental results show that the new R-D models result in a highly efficient bit allocation scheme, which outperforms the JSVM benchmark by a significant margin, and the average coding gain achieves 1.22 dB.

Related Articles | Metrics | Comments（0）

Select

Automatic Table Boundary Detection and Performance Evaluation in Fixed-Layout Documents

FANG Jing,GAO Liangcai,QIU Ruiheng,TANG Zhi

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （636）

PDF（pc）（677KB）（479）

Save

The authors propose a novel and effective table boundary detection method via visual separators and geometric content layout information, which is effective for both Chinese and English documents. Additionally, due to the lack of automatic evaluation system for table boundaries detection, the authors also provide a publicly available large-scale dataset, composed of same amount of Chinese and English pages make ground-truth and propose mobile reading oriented performance measurements. Evaluation and comparison with two other open source table boundary detection projects demonstrates effectiveness of the proposed method and practicality of the evaluation suit.

Related Articles | Metrics | Comments（0）

Select

Chinese Textual Image Compression Based on Multi-feature Extraction

HU Kui,TANG Zhi,GAO Liangcai

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （672）

Save

Based on the characteristics of Chinese textual images, an improved compression algorithm MC-JBIG2 is developed. First, multilevel feature data of Chinese characters are extracted; then the data are used in a cascaded clustering algorithm to replace the pattern matching procedure of JBIG2. Experiment results show MC-JBIG2 is highly efficient and can improve compression ratio in the criterion of not involving substitution errors. Despite designed for Chinese textual images, MC-JBIG2 can also improve compression ratio of English textual images.

Related Articles | Metrics | Comments（0）

Select

An Approach to Auto-detection, Segmentation and Tagging of Bibliographic Metadata

GAO Liangcai,TANG Zhi,TAO Xin,FANG Jing

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （760）

Save

After reviewing the existing methods on citation data extraction, the authors propose a new approach for the task depending on a common typesetting practice of bibliographies: style consistency of citation data in the same document. Citation data detection and segmentation task are described on which less attention is put in previous researches. Furthermore, the authors take advantage of the style consistency of bibliographies to enhance citation metadata tagging. Experimental results show that the proposed method performs well in citation data detection, segmentation and tagging.

Related Articles | Metrics | Comments（0）

Select

XTrim: An XML Compressor Based on XML Schema and Tiny Data Block Optimization

QIU Ruiheng,TANG Zhi,HU Wei,GAO Liangcai

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （726）

Save

The authors propose an XML compressor based on XML Schema and tiny data block optimization (XTrim), which minimizes the size of the structure in XML documents and improves the data grouping strategy by utilizing information in XML Schema. Especially, tiny data blocks in XML document are optimized by XTrim to achieve a higher compression ratio. Experimental results show that the proposed approach outperforms other compressors when handling XML documents.

Related Articles | Metrics | Comments（0）

Select

A Table of Content Recognition Method of Book Documents Based on Clustering Techniques

GAO Liangcai,TANG Zhi,LIN Xiaofan,YU Yinyan,FANG Jing

Acta Scientiarum Naturalium Universitatis Pekinensis

Abstract （539）

Save

After reviewing the merits and drawbacks of the existing ToC ( table of contents) recognition methods, the authors describe an automatic ToC recognition method with high efficiency and adaptability. Based on style consistency of ToC in book documents, this method employs clustering to detect decorative elements and to generate an adaptive ToC model which can be used to extract ToC entries and their hierarchies. Experimental results show that this method achieves high accuracy and efficiency. Especially, it performs well in processing complicated ToC with decorative elements, broken lines and various hierarchical structures. This method has been successfully applied in a commercial E-book production line.

Related Articles | Metrics | Comments（0）